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Abstract 

Background: Integrase inhibitors (INI) form a new drug class in the treatment of HIV-1 patients. We developed a 
linear regression modeling approach to make a quantitative raltegravir (RAL) resistance phenotype prediction, as 
Fold Change in IC50 against a wild type virus, from mutations in the integrase genotype. 

Methods: We developed a clonal genotype-phenotype database with 991 clones from 153 clinical isolates of INI 
naive and RAL treated patients, and 28 site-directed mutants. 

We did the development of the RAL linear regression model in two stages, employing a genetic algorithm (GA) to 
select integrase mutations by consensus. First, we ran multiple GAs to generate first order linear regression models 
(GA models) that were stochastically optimized to reach a goal R 2 accuracy, and consisted of a fixed-length subset 
of integrase mutations to estimate INI resistance. Secondly, we derived a consensus linear regression model in a 
forward stepwise regression procedure, considering integrase mutations or mutation pairs by descending 
prevalence in the GA models. 

Results: The most frequently occurring mutations in the GA models were 92Q, 97A, 143R and 155H (all 100%), 
143G (90%), 148H/R (89%), 148K (88%), 1511 (81%), 121 Y (75%), 143C (72%), and 74M (69%). The RAL second order 
model contained 30 single mutations and five mutation pairs (p < 0.01 ): 143C/R&97A, 155H&97A/151I and 
74M&151I. The R 2 performance of this model on the clonal training data was 0.97, and 0.78 on an unseen 
population genotype-phenotype dataset of 171 clinical isolates from RAL treated and INI naive patients. 

Conclusions: We describe a systematic approach to derive a model for predicting INI resistance from a limited 
amount of clonal samples. Our RAL second order model is made available as an Additional file for calculating a 
resistance phenotype as the sum of integrase mutations and mutation pairs. 
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Background 

Recently, new drugs have been developed for the treat- 
ment of HIV-1 patients that act at different steps in the 
viral replication cycle [1,2]. Integrase inhibitors (INIs) 
target HIV-1 integrase, an enzyme which mediates the 
integration of HIV-1 viral DNA into the host genome 
[3,4]. Raltegravir is the first INI approved by the FDA, 
for use in treatment-naive and treatment-experienced 
patients [5,6]. Elvitegravir and S/GSK1349572 are two 
other INIs in advanced clinical development [7]. 
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Notwithstanding the success of antiretroviral treat- 
ment of HIV-1 infection, viral replication cannot always 
be completely inhibited and this results in the emer- 
gence of drug resistance. In clinical practice, resistance 
testing has proven to be beneficial in designing potent 
combination regimens. Genotypic tests are preferred to 
phenotypic tests because of lower cost and faster turn- 
around time. However, phenotypic tests can provide use- 
ful additional information, especially for more complex 
mutational patterns [8,9]. In this respect, linear regres- 
sion is successfully applied as a diagnostic service for 
clinicians, by modeling drug susceptibility as a function 
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of the mutations in the patients viral genome regions 
that encode for the enzymes HIV-1 protease and reverse 
transcriptase [10]. 

In this article, we describe our approach to also generate 
linear regression models to predict INI resistance from 
mutations in the integrase (IN) genetic region [11,12]. We 
show how we applied the methodology for raltegravir 
(RAL) in deriving a first and second order model on an in- 
house developed clonal genotype-phenotype database. We 
report on the performance of both RAL models on four dif- 
ferent datasets available for analysis: the two datasets that 
we used during model development - the clonal database 
(training set), and an external set of site-directed mutants 
that we used for evaluation of mutation pairs for our 
second order model (validation set) - and two population 
datasets of clinical isolates: the dataset with samples from 
which we derived the clones (seen data), and an indepen- 
dent test set (unseen data). 

Our results indicated that RAL resistance could be 
accurately predicted using linear regression modeling. 

Methods 

Clonal INI genotype-phenotype database construction 

We derived the Virco clonal INI genotype-phenotype data- 
base from 153 clinical isolates, originating from INI naive 
and RAL treated patients, including 106 HIV-1 infected 
patients previously described [13]. Plasma samples were 
collected before and/or during RAL treatment. 

The production of the population recombinant viruses 
was done as previously described [13]. Briefly, RNA is 
extracted from plasma and the IN gene is amplified. The 
replication-competent recombinant virus stocks were pro- 
duced via homologous recombination in MT4 cells. The 
purified IN amplicons were recombined within the cells 
with the pHXB2-AIN backbone by Amaxa nucleofection. 
The cell cultures were microscopically monitored for the 
appearance of cytopathic effect during the course of infec- 
tion. When full cytopathic effect was reached, the superna- 
tants containing the recombinant viruses were harvested by 
centrifugation. For the production of the clonal recombi- 
nant viruses, the purified IN amplicons were cloned into 
the backbone pHXB2-DIN-eGFP using the Clontech In- 
Fusion technology, following the manufacturers protocol. 
The recombinant plasmids were transformed into Max Effi- 
ciency Stbl2 cells (Invitrogen) using the manufacturer's pro- 
cedure. Individual clones were randomly picked and 
cultured to prepare full-length vector HIV-1 genome DNA 
using the QiaPrep Spin Miniprep system (Qiagen). 
Replication-competent recombinant virus stocks were gen- 
erated by nucleofection of full-length HIV-genome plas- 
mids into MT4 cells (Amaxa Biosystems, Cologne, 
Germany). The cell cultures were microscopically moni- 
tored for the appearance of cytopathic effect during the 
course of infection. When full cytopathic effect was 



reached, the supernatants containing the recombinant 
viruses were harvested by centrifugation. 

The recombinant viruses were titrated and subjected 
to an antiviral experiment in MT4-LTR-eGFP cells as 
previously described [13]. Fold change (FC) values were 
calculated, using the HIV-1 wild- type strain IIIB as a 
reference. 

Sequence analysis was also done as previously 
described [13]. Genotypes were defined as a list of IN 
mutations compared to the HIV-1 wild-type strain 
HXB2. 

In total, our INI genotype-phenotype clonal database 
consisted for RAL of 991 clonal viruses: 899 clones 
derived from 153 clinical isolates (93.7% clade B, 6.3% 
clade non-B), 4 pHXB2D clones and 88 clones derived 
from 28 site-directed mutants, with a minimum of 2 
clones per site-directed mutant. The site-directed 
mutants incorporated in the clonal database were the 
ones described in [13]: 66A, 661, 92Q, 143R, 147G, 
148R, 155H, 92Q + 147G, 92Q + 155H, 140S + 148H and 
721 + 92Q + 157Q. In addition, site-directed mutants 
were constructed for IN mutations with score > 0 for 
RAL/elvitegravir(EVG) in the Stanford algorithm 6.0.11 
(http://hivdb.stanford.edu) and either absent in patient 
derived clones: 66K, 92V, 114Y, 121Y, 125K, 128T, 140C, 
143H, 145S, 146P, 151A, 153Y, 155S and 263K or under- 
represented: 51Y (1 clone) and 143C (11 clones). Mu- 
tation 72A was not found in any of the patient derived 
clones and it does not appear in the Stanford database 
of INI resistance mutations (http://hivdb.stanford.edu/ 
DR/INIResiNote.html). Therefore a site-directed mutant, 
which had been previously created and in vitro had FCs of 
1.71 and 4.85 for RAL and EVG, respectively was included 
in our database. By picking on average 6 clones for each of 
the 153 clinical isolates and including site-directed mutants, 
the IN database consisted of 433 unique clonal genotypes. 

We calculated a biological cutoff for RAL [14] for the 
clonal database as the 97.5 percentile of the log FC phe- 
notypes of patient-derived clonal viruses not containing 
any of the mutations listed in the RAL product label 
[15]: 74M, 92Q, 97A, 138A/K, 140A/S, 143C/H/R, 
148H/K/R, 1511, 155H, 163R, 183P, 226C/D/F/H, 230R 
and 232N, and/or not containing mutations with score > 
0 for RAL/EVG in the Stanford algorithm 6.0.11. Before 
calculating the biological cutoff, we removed outliers 
(log FC > mean log FC + 3 standard deviations). 

Consensus linear regression modeling for INI 

To perform linear regression on our clonal genotype- 
phenotype database, we first encoded the clonal genotypes 
as 0/1 (absence/presence) for all IN mutations present at 
least once in the database. 

We then used a two-stage genetic algorithm (GA) con- 
sensus approach to derive a linear regression model for 
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calculating INI resistance (log FC) as the sum of IN 
mutations or mutation pairs. In stage 1, we ran multiple 
GA searches to find first order regression models with 
R >goal R (GA solutions). In stage 2, we used a step- 
wise regression procedure to generate a first order/second 
order consensus model by considering IN mutations or 
mutation pairs for entry by descending prevalence in these 
GA solutions. 

Stage 1: Run multiple GAs to select and rank IN 
mutations 

In concept, a GA [16,17] is a computational search proce- 
dure where a randomly initialized set (population) of 
encoded genotypes (chromosomes) is evolved over several 
generations by optimization of the quality (fitness) of the 
chromosomes, and applying genetic operators (mutation and 
crossover). The GA search is successful once a chromosome 
with fitness > goal fitness (GA solution) is found. 

In our application, in search for an INI resistance linear 
regression model with R 2 > goal R 2 , a chromosome was a 
fixed-length subset of IN mutations. The fitness of a 
chromosome was evaluated by calculating the R 2 of the li- 
near model. The implementation of the genetic operators 
was as follows. The mutation genetic operator randomly 
replaced an IN mutation used as linear model parameter 
by another IN mutation. The crossover genetic operator 
randomly combined two chromosomes present within the 
population. In generating a new population, the principle 
of natural selection applied: IN mutations present in chro- 
mosomes that were more fit (higher R ) had more chance 
to be selected in a chromosome in the next generation. To 
avoid overfitting, we chose the different GA parameter 
settings such that a chromosome reached the goal fitness 
within a limited number of generations. As we ran mul- 
tiple GAs, we could make a ranking of IN mutations based 
on their prevalence (high to low) in the different GA 
solutions. 

For RAL, we performed multiple GA runs until 100 
solutions were obtained for making a GA ranking. The 
GAs were run using the R package GALGO [18] with 
the following settings: population size = 20, chromosome 
size = 30, maximum number of generations = 500, goal 
fitness = 0.95, mutation probability = 0.05 and crossover 
probability = 0.70. 

Stage 2: Run stepwise regression to derive a GA 
consensus first order/second order model 

We derived a consensus first order linear regression model 
by means of forward stepwise regression, considering IN 
mutations in order of the GA ranking, and using Schwarz 
Bayesian Criterion (SBC) for selection. The stepwise pro- 
cedure ended when SBC reached a minimum [19]. In build- 
ing the RAL consensus first order linear regression model, 



we considered mutations that were consistently selected 
(> 10% prevalence in the GA solutions). 

To account for synergistic and antagonistic effects be- 
tween mutations, we allowed mutation pairs (second 
order interaction terms) of which both mutations in the 
pair were present in more than T% of the GA models 
for entry in the model. A threshold of T = 100% corre- 
sponded with a first order linear regression model, while 
lowering T allowed for more interaction terms. For RAL, 
we chose the threshold T to maximize the R perfor- 
mance on a public geno/pheno set of 67 IN site-directed 
mutants, available from Stanford (http://hivdb.stanford. 
edu/cgi-bin/IN_Phenotype.cgi), contributed by the fol- 
lowing sources: [20] (11 isolates), [21] (14 isolates), [22] 
(18 isolates), [23] (10 isolates) and [24] (14 isolates). Phe- 
notyping of the isolates in this external geno/pheno set 
had been done with the PhenoSense assay (Monogram, 
South San Francisco), providing for validation of the in- 
house Virco assay. In the stepwise selection procedure, 
we kept IN mutations as first order terms in the model 
when also present in a mutation pair. 

Performance evaluation of RAL linear regression model 

We analyzed the R 2 performance on the clonal database 
(training set), on the external geno/pheno set (validation set 
(see previous section)), on the population genotype- 
phenotype data of the clinical isolates that were used for 
the clonal database (population seen data), and on popula- 
tion genotype-phenotype data of 171 clinical isolates from 
RAL treated and INI naive patients, that were not used for 
the clonal database (population unseen data). This unseen 
test set contained clonal genotypes from the three resis- 
tance pathways: 143, 148, and 155. We analyzed the per- 
formance on population data (seen/unseen) separately for 
clinical isolates with/without mixtures that contain one or 
more mutations from the second or first order linear 
regression model (a mixture is defined as an ambiguous 
sequencing result at a given amino acid position). To 
predict the phenotype for isolates containing mixtures, we 
used equal frequencies for all variants [10]. We also calcu- 
lated the R 2 performance on the clinical isolates with mix- 
tures after removal of outlying samples (having a 
studentized residual larger than 2 in absolute value). To 
compare the performance of first and second order models, 
we used the Hotelling- Williams test [25]. 

We also used the exact binomial test to calculate the 
95% confidence interval for the true mixture frequencies 
from the observed variant frequencies in the clones. We 
used these mixture frequencies to predict the phenotype 
for the population seen dataset. In case of more than 
one mixture in a genotype, we calculated a predicted 
phenotype for all combinations of lower and upper 
bounds for the different mixtures. We then plotted the 
bars of the resulting lowest and highest predicted value. 
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In the population unseen dataset, we evaluated the 
linear model biological cutoff call (Susceptible (< bio- 
logical cutoff) or Resistant (> biological cutoff)) ver- 
sus three public genotypic algorithms: Stanford 6.0.11, 
Rega v8.0.2 (http://regaweb.med.kuleuven.be/) and 
ANRS May 2011 (http://www.hivfrenchresistance.org). 

Results 

IN clonal genotype/phenotype database 

The IN clonal database consisted of 991 clones with 
genotype and phenotype in log FC for RAL. The distri- 
bution of these pheno types is shown in Figure 1. The 
biological cutoff for RAL FC was calculated to be 2.0. 
The calculation was done on 317 clonal viruses with 
'susceptible' genotypic profile and non-outlying pheno- 
type. This biological cutoff is in agreement with earlier 
values calculated from INI naive patient samples [26,27]. 
The following site-directed mutants that were included 
in the clonal database had a mean FC above the bio- 
logical cutoff for RAL: 66K, 721 + 92Q + 157Q, 92Q + 
147G, 92Q+155H, 121Y, 140S + 148H, 143C, 143R, 
148R, 155H and 155S (Figure 2). 

RAL linear regression model developed on clonal 
database 

The methodology to develop an INI regression model 
was tested for RAL. In generation 264, the average fit- 
ness of the 100 GA models reached the goal fitness: 



R 2 of 0.95. GA runs where the goal fitness was not 
reached with less than 500 generations (9.1%) were 
discarded. As a result of stage 1, fifty mutations out 
of 322 IN mutations were retained with prevalence 
above 10% in the GA models (Figure 3). In stage 2, a 
first order and a second order RAL linear regression 
model were generated, having 27 IN mutations in 
common, among which the following primary and se- 
condary RAL product label resistance associated mu- 
tations: 143C/R, 148H/K/R and 155H (primary), and 
74M, 92Q, 97A, 140A/S, 1511 and 230R (secondary). 
IN mutations present in more than 65 (threshold T) of the 
100 GA models were considered for mutation pairs in the 
second order linear regression model. Five mutation pairs 
resulted from the stepwise regression procedure: 4 consist- 
ing of a primary mutation and a secondary mutation: 
143C/R & 97A and 155H & 97 A/1511. One mutation pah- 
selected for the model consisted of two secondary 
mutations: 74M & 1511 (Figure 3). We analyzed the fre- 
quencies of occurrence of the linear model mutations oc- 
curring in first and/or second order linear regression model 
in the Stanford database for 4240 clinical isolates of 
INI-naive (2274 clade B, 1966 clade non-B) and 183 clinical 
isolates of RAL-treated patients (178 clade B, 5 clade 
non-B) (http:/ /hivdb.stanford.edu/cgi-bin/II_Form.cgi) (see 
Additional file 1). R 2 performances of the RAL linear model 
on the training data were 0.96 and 0.97 in first and se- 
cond order, respectively. On the validation dataset the R 2 
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Figure 1 Phenotype distribution within the INI clonal genotype-phenotype training dataset. RAL log FC of 991 clones derived from clinica 
isolates and site-directed mutants. RAL biological cutoff was 0.30 log FC or 2 FC 41 .0% of the clones were found below the biological cutoff and 
classified as (S)usceptible, whereas 59.0% of the clones were found above the biological cutoff and classified as (R)esistant. Censoring was applied 
for high FCs. 
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Figure 2 Phenotypes of wild-type pHXB2D and site-directed mutants. RAL FC of wild-type pHXB2D (4 clones) and 28 site-directed mutants 
(88 clones). At least 2 clones were included for each site-directed mutant. 



performance was 0.79 and 0.80 in first and second order, 
respectively (Table 1). Table 1 also contains the perfor- 
mance on population data, further described in the next 
sections. 

The R 2 performance on the validation data improved 
from 0.80 to 0.91 for the RAL second order linear model 
after removal of three outliers: 148K + 140S, 661 + 92Q 
and 143C + 97A (Figure 4). The first and second outlier 
mutation combination were not present in the clonal 
database. For the third outlier four clones, derived from 
one patient, were present. 

Performance of RAL linear regression model on 
population data (seen) 

The frequencies of the linear model mutations in the 
patient-derived clonal genotypes and in the population ge- 
notypes for the same patients were largely similar (Figure 5). 
However, IN mutation 143C was less frequently observed 
in clones than in the population genotypes, and we made a 
site-directed mutant for this mutation (Figure 2). The fol- 
lowing linear model mutations were not found in any of 
the patients and appeared in the model as a result of the 
included site-directed mutants: 66K, 121Y and 155S 
(Figures 2, 3, 5). The R 2 performance of the first order and 
second order linear model on the population genotypes 
with measured phenotype was 0.90 (Table 1). The R 2 
performance was analyzed separately for samples with/ 
without mixtures containing linear model mutations. The 



percentage of samples without mixtures, as detected by 
population sequencing, was 72.9%. Clonal genotypes were 
more diverse for the group of clinical isolates with one or 
more mixtures containing linear model mutations in their 
population genotype (Table 2). The R performance on 
samples without mixtures was 0.95 in first and second 
order. The R 2 performance on the samples with mixtures 
was 0.73 and 0.71 in first and second order, respectively 
and increased to 0.84 and 0.81 after removal of outliers 
(Table 1 and Figure 6). Although the evaluation with error 
bars shows that the range of the predicted phenotype due 
to mixtures containing linear model mutations can be wide, 
averaging for mixtures resulted overall in a good correlation 
with the measured phenotype (Figure 6B). 

Performance of RAL linear regression model on 
population data (unseen) 

On the unseen data the R 2 performance was 0.76 and 0.78 
for the first and second order model, respectively (Table 1, 
Figure 7A). Eighty-nine percent of the unseen population 
genotypes had no mixtures containing linear model muta- 
tions and had an R 2 performance of 0.79 and 0.81 in first 
and second order, respectively. Using the online prediction 
tool geno2pheno integrase 2.0 (http://integrase.bioinf.mpi- 
inf.mpg.de/index.php), the R 2 performance was 0.75 and 
0.76 on the unseen data and the unseen data without mix- 
tures, respectively. Using the RAL biological cutoff, a resis- 
tance call was made for all of the unseen samples. A 



Van der Borght et al. Virology Journal 2013, 10:8 
http://www.virologyj.eom/content/10/1/8 



Page 6 of 12 



Clonal database: 
322 IN mutations 



I Stage 1 



GA ranking: 
50 IN mutations (frequency > 10%) 




Consensus first order linear model 
32 IN mutations 



27 IN mutations in common 



Consensus second order linear model 
30 IN mutations & 5 mutation pairs 



B 



""""' y i43R 


primary 100 

155H 


secondary 100 

92Q 


secondary 100 

97A 


other ^primary 

143G | I48H 


primary 89 

148R 


primary 88 

148K 


secondary 81 

1511 


other 75 

121Y* 


primary 72 

143C 


74M 




140S 


secondary 4B 

230R 


order 46 

84 L 


secondary A £ 

140A 


other 31 

200L 


other 27 

195C 


other 24 

91T 


other 22 

253N 


secondary 19 

163R 


95T 


155S* 


other 17 

135V 

>0.1-2.0 


°' her 124Q 


25E 


113V 


136Q 


230N 


other 12 

119R 


secondary 1 1 

138A 


>2.0-10.0 






c 


primary 100 

143R 


primary 100 

155H 


secondary 100 

92Q 


secondary 100 

97A 


143R&97A° 


100 

155H &97A 


ofher 90 

143G 


primary 89 

148H 


primary 89 

148R 


primary 88\sticondary 81 

148K | 1511 


155H & 1511 


other 75 

121Y* 


primary 72 

143C 


143C&97A 


secondary 69 

74M 


other 65 


56 

74M & 1511 


secondary 51 

140S 


secondary 48 

230R 


secondary 44 

140A 


ofter 34 

72L 


other 27 

195C 


offier 24 

91T 


other 1 9 

103R 


secondary 1 9 

163R 


other 1 8 

95T 


other 1 8 

155S* 


ofhor 15 

124Q 


25E 


"""" 113V 


°"™' 134E 






other 12 >0.1-2.0 >2.0-10.0 




136Q 


"""" 230N 


119R 





Figure 3 RAL Linear models. (A) In the clonal genotype-phenotype database, fifty mutations were retained with frequency > 10% in the GA 

models. Stepwise regression was used for building a first order linear regression model and a second order linear regression model. (B) First order 
linear regression model. (C) Second order linear regression model. In (B) and (C) primary and secondary mutations are indicated according to the 
drug label (upper left corner) and prevalence in GA models is shown (upper right corner). Colouring is based on the FC contribution of the 
mutations/interaction pairs in the linear regression model: >0.1-2 (green), >2-10 (yellow), >10 (red), introduced as site-directed mutant in the 
model: 66K, 1 21 Y and 155S. 



Table 1 Performance of RAL first/second order linear model 
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N 


FIRST order 


SECOND order 


(p- value) 


train 


ALL 


991 


0.96 


0.97 


0.0019 


validation 


ALL 


67 


0.79 


0.80 


0.3753 


population seen 


ALL 


144 a 


0.90 


0.90 


0.1530 




NOMIX 


105 


0.95 


0.95 


0.3558 




MIX > = 1 


39 


0.73 


0.71 


0.0373 




MIX > = 1 
(no outliers) 


36 


0.84 


0.81 


0.0063 


population unseen 


ALL 


171 


0.76 


0.78 


0.0992 




NOMIX 


153 
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0.81 


0.1358 




MIX > = 1 


18 


0.59 


0.58 


0.7482 




MIX > = 1 
(no outliers) 


16 


0.78 


0.78 


0.8819 



a For 144 out of 153 samples a matched genotype-phenotype was available. 

Performance in R 2 of first order and second order linear model on the clonal geno/pheno training database, on the externally PhenoSense measured IN site- 
directed mutants (validation) and on population data, seen and unseen. After Bonferroni correction for multiple testing no significant differences in R 2 
performance are seen between the first and second order model on the validation dataset or population seen/unseen datasets (p-values listed are unadjusted). 
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Figure 4 Performance of RAL second order model on external validation set. The R performance of the RAL second order model on the 
validation set containing 67 IN site-directed mutants (measured with PhenoSense assay) was 0.80. Three potential outliers were identified: 148K + 
140S, 66I + 92Q and 143C + 97A. After outlier removal the R 2 performance was 0.91 (dotted regression line). In the graph, site-directed mutants are 
classified according to the primary mutation harboured. 



25.0% 



20.0% 



15.0% 



>. 
u 
c 

<D 
3 
IT 
01 

,T 10.0% 



5.0% 



0.0% 



3 population data seen {n= 149) 
■ clonal data (n= 899) 



dlL 



jI 



! * -J E H 
m (M 5 i- 

. S i- £ o 



ii-<d:>d:>Oujo<«oou:i^u: 

ilGsnnoii-^^iDOOnnncococo 

|0>0)0'-T-(V|(MncOl , tt**fl , "»t 



Jul 



dlhlL 



I o z 

n m o 

CO O) CO 



CC - O 

£ 10 2 
m ■,- f 

M 



CC X X 

n in m 

^ in m 

5 S S 

< < ? 

Si o> 10 



Consensus second order linear model 



-| > < -J z 

3; m co o co 

eo co co o in 

1- i- CM CM 



Consensus 
first order 

linear model 
mutations 



Figure 5 Prevalence of RAL linear model parameters in clonal vs. population genotypes. Frequency of RAL linear model parameters in clonal 
and population genotypes retrieved from the same patients. Using Fisher exact test, there are no statistically significant differences between clonal and 
population percentages, with exception of 143C (P-value < 0.001) and 97A&143C (P-value is 0.004) where clonal frequency is lower. 
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Table 2 Diversity of clonal genotypes derived from clinical isolates 



Group (#mixtures in 
population genotype) 


#clinical 
isolates 


Average #clones/ 
clinical isolate 


Average #unique clonal genotypes/ 
clinical isolate 


Percentage of unique clonal 
genotypes 


0 mixtures 


105 


5.8 


2.4 


42.0% 


1 mixture 


26 


5.8 


3.8 


64.9% 


> 2 mixtures 


13 


7.0 


4.5 


64.8% 



Population data seen (144 genotypes, with matched phenotype) divided in groups depending on the number of mixtures containing linear model mutations. 



resistant (R) and susceptible (S) call was given to the sam- 
ples with linear model prediction above and less or equal 
than the biological cutoff, respectively. For the samples with 
a concordant call between ANRS, Rega and Stanford (93% 
of the samples, Figures 7B and 7C), the first and second 
order linear model call were in agreement, with exception 
of one sample (A91A/T, I135V) called resistant by the first 
order linear model. The remaining 7% of samples with dis- 
cordance between the genotypic algorithms are given in 
Figure 7D and Table 3. One third of these discordances 
contained the IN mutation 157Q, called resistant by ANRS 
algorithm but susceptible by the first and second order li- 
near model, Stanford and Rega algorithms. Two samples 
(L74M, V151A/V; T97A) were found to be susceptible by 
the second order model, but resistant by the first order 
model. To be precise, the sample T97A had a second order 
model predicted FC of 2.0, equaling the RAL biological cut- 
off value. Samples containing the secondary mutations 
74M and 97A, were also called intermediate resistant (I) by 
the Rega algorithm. Other discordances found were related 
to the IN mutations 121Y (called resistant by the RAL 



linear model) and 138K (called susceptible by the RAL li- 
near model). 

Discussion 

We developed a methodology for predicting INI suscep- 
tibility, applying linear regression on a clonal genotype- 
phenotype database. Our modeling approach differs 
from most of the other genotypic INI resistance inter- 
pretation systems by providing a quantitative FC predic- 
tion. A particular advantage of our model is that 
predictions can be directly interpreted as a weighted 
sum of mutations and interaction pairs. We have made 
our RAL second order linear regression model available 
as PDF fillable form in Additional file 2 such that it can 
be used for rapid prediction of RAL susceptibility. 

Previously, we described a computationally feasible 
technique for developing parsimonious linear regression 
models on large genotype-phenotype datasets for the 
identification of novel HIV-1 drug resistance associated 
mutations [28]. In this article, as the number of patients 
failing INI treatment was limited, our primary objective 
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Figure 6 Performance of RAL linear regression model on population seen data. (A) R 2 performance of linear model on patient genotypes 
without mixtures was 0.95. (B) Scatterplot of measured FC vs. predicted FC for genotypes containing mixtures from 39 clinical isolates. Error bars 
are drawn to indicate the uncertainty of the second order linear model prediction as the true mixture frequencies are unknown. For the three 
outliers identified, each containing a mixture at a primary mutation in the population genotype, the percentage of clones containing that 
particular primary mutation was: 0% (0/5, 148R), 50% (1/2, 143R) and 100% (2/2, 148R). 
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Figure 7 Performance of RAL linear regression model on population unseen data. Scatterplot of measured FC vs. predicted FC for the first 
and second order linear model on unseen data. Horizontal and vertical axes are drawn through the biological cutoff value of FC = 2.0. (A) The R 2 
performance was 0.76 (first order), 0.78 (second order) and 0.75 (geno2pheno). (B) Samples called resistant by ANRS, Rega and Stanford are 
predicted resistant by the linear model. (C) Samples called susceptible by ANRS, Rega and Stanford are predicted susceptible by the linear model. 
(D) Samples with discordant call between the genotypic algorithms. 



was to develop a methodology for training a linear 
regression model on a relatively small dataset. We 
increased the quality of the correlative genotype- 
phenotype data by taking multiple clones for each of the 
clinical isolates [26], allowing to more accurately model 
the resistance contribution of IN mutations or mutation 
pairs. Moreover, to avoid overfitting, we generated an 
INI model by consensus linear regression modeling, 
using a GA for selection of IN mutations [29,30]. Mul- 
tiple clones taken from the same patient largely con- 
firmed the independence of the RAL resistance 
pathways 143, 148 and 155 [24,31,32]. For one patient, 
previously described in [33], four clones were picked 
containing both 143C and 155H. Mutation 143C was 
found to have a low prevalence in the clonal database. In 
[34] a transition from 143C to 143R was suggested, and 
in our RAL linear model 143R had a larger contribution 
towards resistance than 143C. 143G was another 



resistance associated variant at position 143 selected for 
our linear model, and has been described in [35,36]. 
Obviously, our approach is still limited to detecting re- 
sistance associated mutations or combinations of muta- 
tions with presence in the training dataset. This was in 
part overcome by inclusion of site-directed mutants in 
the analysis, which we consider valuable in improving 
the generalizability of the model. 

We evaluated the performance of the RAL linear 
model on an unseen population dataset. For RAL, the 
additive first order model had an overall equal perfor- 
mance to the second order model, which accounted for 
synergism or antagonism. However, for an individual 
sample (T97A) with secondary mutation 97 A, found in 
absence of a primary mutation, a discordance was seen 
between the first and second order linear models. It was 
scored resistant by the first order model and susceptible 
by the second order model when using a biological 
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Table 3 Samples in the unseen population dataset with discordant call between the genotypic algorithms 







Genotypic algorithms 








Linear models 






ANRS (May 2011) 


Rega v8.0.2 




Stanford 6.0.1 1 


First order 




Second order 




Sample 3 


mutations 


call 


mutations call 


mutations 


call 


mutations 


BCO 
call 


mutations 


BCO 
call 


L741LM 


7*71 7/1 M 

/zi, /mv\ 


C 
J 


74M, 1 56N 


I 


7/1 hA 
/4IV1 


c 
o 


74M 


S 


74M 


S 


L74M, A91 T, T97A/T, 
F121Y 


111 7AKA Q7A 
/ZI, /4IVI, y/A, 

121 Y 


D 
1 \ 


74M, 97A, 
206S 


I 


7AhA 07A 

121 Y 


i 


74M, 91T, 97 A, 119R, 
121 Y 


R 


74M, 91 T, 97A, 119R, 
121 Y 


R 


L74M, K103R, 1135V, 
200 L 


721, 74M 


s 


74M 


I 


74M 


S 


74M, 1 35V, 200L 


R 


74M, 103R 


R 


L74M, V151 A/V 


721, 74M 


s 


74M 


I 


74M, 151 A 


S 


74M, 1 35V, 200L 


R 


74M, 103R 


S 


A91 T, F121Y 


721, 121Y 


R 


206S 


S 


121 Y 


1 


91 T, 119R, 1 21 Y 


R 


91T, 119R, 121Y 


R 


T97A 


72I, 97A 


S 


97A, 206S 


I 


97A 


s 


25E, 97A 


R 


25 E, 97A 


S 


E138E/K 


72I 


S 


138K, 206S 


S 


138K 


1 


135V 


S 




S 


E138E/K 




S 


138K 


S 


138K 


1 


25E 


S 


25E 


S 


E157Q 


72I, 157Q 


R 


156N 


S 


157Q 


s 




S 




S 


E157Q 


72I, 157Q 


R 


206S 


S 


157Q 


s 


135V 


S 




S 


E157Q 


157Q 


R 




S 


157Q 


s 




S 




S 


E157Q 


157Q 


R 


156 N, 206S 


S 


157Q 


s 




S 




S 



a HXB2 reference amino acid is shown. 

The mutations on which the RAL S(usceptible)/I(ntermediate)/R(esistant) call is based, are listed. For the linear models, the call is R(esistant) if FC prediction is 
above the RAL biological cutoff (BCO = 2.0). 



cutoff of 2. In two other samples (T97A/T, Y143R; 
E92E/Q, T97A/T, N155H) where primary mutations 
143R or 155H occurred together with 97A (in mixture 
with wild type), the increased resistance conferred by 
the combinations 143C/R & 97A [37] or 155H & 97A, 
was in the second order model accounted for by inter- 
action terms. Because the second order model explicitly 
includes combination effects, we consider it more useful 
than the first order model. All interaction terms in the 
second order model were found to be synergistic. A high 
concordance in RAL resistance call was seen between 
the linear model and the publically available genotypic 
algorithms: Stanford, Rega and ANRS. However, major 
discordances were observed for samples without a pri- 
mary mutation and containing mutation 157Q or 121Y. 
For the discordance involving 157Q, already discussed in 
[38], four clinical isolates (E157Q) from different 
patients were called Susceptible by the linear model, 
Stanford and Rega, but Resistant by ANRS. For the dis- 
cordance involving 12 1Y, one clinical isolate (A91T, 
F121Y) was called Resistant by the linear model and 
ANRS, Intermediate resistant by Stanford, but Suscep- 
tible by Rega. According to [11], the in vivo selection of 
121Y has not yet been reported. In the current study, 
one patient was found in the unseen dataset, who had 
indeed developed the 121Y mutation. However, as 121Y 
was not observed in any of the patient derived clones for 
training of the linear model, we had made seven site- 
directed mutant clones for the clonal genotype- 
phenotype database, confirming the in vitro effect of 



121Y [7] on RAL resistance. As a result, 121Y could be 
and was selected for the linear model, and contributed 
to the FC prediction of the two clinical isolates from the 
aforementioned patient. Note that in the genotype of 
these isolates also the rare mutation 91T was found, a 
mutation that has not been associated with RAL resis- 
tance, but contributed to resistance in the RAL linear 
model. From the unseen data, it seems as if 91T may be 
a background mutation that is currently overweighted in 
the linear model. However, more samples are needed to 
be conclusive about 91T. 

Other rare mutations in the RAL linear model that 
needed to be inspected more carefully were 72L and 
84L, as they are currently undescribed and contributed 
to resistance in the second and first order model, re- 
spectively. Remarkably, 72L and 84L co-occurred in the 
clonal genotypes of nine clinical isolates derived from a 
single patient (only 72L appeared in another clinical iso- 
late, by itself). In the clones of this patient the secondary 
mutations 74M, 92Q and 1511 were also found, in ab- 
sence of any primary mutations, and the measured RAL 
FCs were above the biological cutoff (42.9-77.4). Thus, 
although 72L and/or 84L are potential RAL resistance 
associated mutations, it may be possible that resistance 
for this patient is explained by a more complex synergis- 
tic interaction between 74M, 92Q and 1511. Note that 
mutation pair 74M & 1511 had been selected for the 
RAL second order linear model, which already indicates 
that INI resistance can be developed between interacting 
secondary mutations, in absence of a primary mutation. 
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Moreover, interactions between mutations are expected 
to become more important in elucidating genotype-INI 
susceptibility phenotype relationships once several INIs 
will be co-administered. 

When comparing the R 2 performance of the RAL li- 
near model on population data, unseen vs. seen, a lower 
R 2 performance on unseen data was observed. This dif- 
ference in performance was acceptable as in the unseen 
dataset there were more clinical isolates that did not 
contain any of the primary RAL resistance mutations in 
their genotype (82.5% vs. 45.0%), and the measurement 
error of the phenotypic assay was relatively larger for 
low FC values. 

In the described approach, ordinary least squares regres- 
sion (OLS) was used without taking into account the cor- 
relation between genotypes-phenotypes of clones from the 
same clinical isolate or site-directed mutant. One way to ac- 
count for such correlation would be to replace OLS by a 
linear mixed model with as fixed effects the linear model 
mutations and mutation pairs as in the RAL second order 
linear model (Figure 3), and with the clinical isolate/site- 
directed mutant as random factor. The predictive perfor- 
mance of the resulting model in terms of R 2 changed from 
0.80 to 0.82 and from 0.78 to 0.79, on the external vali- 
dation set, and population unseen dataset, respectively. 
Such a minor change was not unexpected since OLS pa- 
rameter estimates are known to be unbiased, even when 
the correlation structure is neglected [39] . Nevertheless, for 
future work it could be beneficial in using a mixed model 
instead of OLS for the GA models to improve the selection 
of the mutations and mutation pairs. 

In conclusion, RAL resistance could be estimated using 
linear regression modeling and produced results that were 
generally consistent with those observed for samples ana- 
lyzed by Stanford, Rega and ANRS algorithms or the online 
prediction tool geno2pheno. The quality of the INI suscep- 
tibility models is improved by developing the models on a 
clonal genotype-phenotype database and using a GA con- 
sensus approach. A quantitative linear model predicted 
phenotype is interpretable and informative about the effect 
of combinations of mutations on INI resistance. The linear 
regression modeling approach allows generating reliable 
models for INIs once viral isolates have been obtained du- 
ring or after selective pressure of these INIs, even for rela- 
tively small numbers of patients. 
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